| Functions.1 | Functions.2 | Functions.3 |
|---|---|---|
| check_acc | desc_df | find_outliers |
| quickplot | remove_outliers | RSS |
| show_df | smart_round |
Temperature data option calculation
1 Weather Derivatives Temperature Options
Full dependencies list, click to expand list
Functions loaded from main
Packages required to run this file
Show code
required_packages(file)| Required.Packages.1 | Required.Packages.2 | Required.Packages.3 |
|---|---|---|
| caret | dplyr | DT |
| e1071 | fBasics | forecast |
| ggplot2 | gt | gtExtras |
| leaflet | lubridate | MASS |
| nlme | NMOF | PerformanceAnalytics |
| plotly | quantmod | reshape2 |
| rugarch | splines | stats4 |
| tidyr | timeDate | timeSeries |
| TTR | xts | zoo |
Functions defined in this document
| Required.Functions.1 | Required.Functions.2 | Required.Functions.3 |
|---|---|---|
| apply_convolution | sin_component | create_spline_plot |
Show code
cat("time of creation", "\n")time of creation
Show code
print(file.info(file)$ctime, "\n")Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone '
'
[1] "2024-08-27 15:52:39 GMT"
Show code
cat("LAST MODIFICATION", "\n")LAST MODIFICATION
Show code
print(file.info(file)$mtime, "\n")Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone '
'
[1] "2025-05-06 15:27:18 GMT"
Show code
cat("Last Access", "\n")Last Access
Show code
print(file.info(file)$mtime, "\n")Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone '
'
[1] "2025-05-06 15:27:18 GMT"
References:
2 Climate data download
Much of the weather market is dominated by temperature derivatives, aimed to protect the holder from unexpected temperature prints. The underlying used to trade these contracts is either Heating Degree Days (HDD) or Cooling Degree Days (CDD). For a specific day n, these can be defined as:
The most popular temperature instruments traded are options, whose payoff function depends on a cumulative sum over a longer period, usually an entire season:
\(HDDn = \max(T_{ref} - T_n, 0)\)
\(CDDn = \max(T_n - T_{ref}, 0)\)
| Option Type | Protection Against | Exercise When | Payout | Example |
|---|---|---|---|---|
| HDD call | Overly cold winters | HDD > K | \(\alpha \cdot\) (HDD - K) | Farmers |
| HDD put | Overly warm winters | HDD < K | \(\alpha \cdot\) (K - HDD) | Ski resorts |
| CDD call | Overly hot summers | CDD > K | \(\alpha \cdot\) (CDD - K) | Utilities |
| CDD put | Overly cold summers | CDD < K | \(\alpha \cdot\) (K - CDD) | Beaches |
NASA Prediction Of Worldwide Energy Resources (POWER) | Data Access Viewer (DAV) [LINK]
This dataset is from the Agroclimatology community, which is crucial for ensuring reproducibility in climate research.
This study utilizes temperature data obtained from the MERRA-2 (Modern-Era Retrospective analysis for Research and Applications, Version 2) dataset, which provides detailed meteorological measurements at a spatial resolution of 0.5 x 0.625 degrees latitude/longitude.
The specific region analyzed, with an average elevation of 288.19 meters, is located at a latitude of 45.5330, and longitude of 9.1911.
It is important to note that the dataset uses a value of -999 to denote missing data, which I’ve removed and replaced with the mean of the overall dataset. Either when a parameter cannot be computed or falls outside the range of available source data
Parameter(s):
T2M_MAX = Temperature at 2 Meters Max for the day (C)
T2M_MIN = Temperature at 2 Meters Min for the day (C)
T2M_AVG = \(\frac {T_{MIN}+T_{MAX}}2\) Temperature at 2 Meters Average for the day (C)
Temperature at 2 meters refers to the air temperature measured 2 meters (approximately 6.5 feet) above the ground. This is the standard height for temperature measurements used in meteorology because it minimizes the effects of ground heating or cooling, providing a more accurate reflection of the ambient air temperature experienced by humans and relevant to various economic activities.
Why? Temperature at 2 meters is commonly used as the basis for weather derivatives. This is because:
Standardization: Temperature at 2 meters is a standardized and widely available data point, making it reliable for use in financial contracts.
Relevance: Many weather-related risks, such as heating degree days (HDD) and cooling degree days (CDD), are calculated based on temperatures at this height. These metrics are commonly used in weather derivatives to hedge against risks related to heating and cooling needs.
Accuracy: Because it represents the temperature most relevant to human activities and energy consumption, it is directly applicable for creating and settling weather derivatives that are based on temperature-related indices.
Show code
ORIGINAL_DATASET <- read.csv("C:/Users/pietr/OneDrive/Desktop/POWER_Point_Daily_19810101_20250228_045d53N_009d19E_LST.csv", skip = 10)
DATASET <- ORIGINAL_DATASET %>%
mutate(T_MAX=remove_outliers(T2M_MAX, fill = "NA") %>% na.approx) %>%
mutate(T_MIN=remove_outliers(T2M_MIN, fill = "NA") %>% na.approx) %>%
mutate(DAY = as.Date(ORIGINAL_DATASET$DOY - 1, origin = paste0(ORIGINAL_DATASET$YEAR, "-01-01"))) %>%
mutate(Month=month(DAY)) %>%
select(c(DAY, YEAR, Month, DOY, T_MAX, T_MIN))
DATASET$T_AVG <- apply(DATASET[c("T_MAX", "T_MIN")], 1, mean)
ORIGINAL_DATASET[ifelse(find_outliers(ORIGINAL_DATASET$T2M_MAX)==0,FALSE,TRUE),] %>%
gt() %>%
opt_stylize(5) %>%
tab_header(title = "Days where the sensor malfunctioned", subtitle = "identified by remove outliers")| Days where the sensor malfunctioned | |||
| identified by remove outliers | |||
| YEAR | DOY | T2M_MAX | T2M_MIN |
|---|---|---|---|
Show code
datatable(smart_round(DATASET, digits = 3))Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(where(is.numeric), round, digits = digits)`.
Caused by warning:
! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
Supply arguments directly to `.fns` through an anonymous function instead.
# Previously
across(a:b, mean, na.rm = TRUE)
# Now
across(a:b, \(x) mean(x, na.rm = TRUE))
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
2.1 Initial data visualization
Show code
cleandataset <- DATASET %>%
select(c(T_MAX, T_MIN, T_AVG)) %>%
xts(order.by = DATASET$DAY)
desc_df(cleandataset)| n | mean | sd | median | trimmed | min | max | range | skew | kurtosis | se | %NA | Q.1% | Q.25% | Q.75% | Q.99% | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| T_MAX | 16130 | 17.8776 | 8.3745 | 18.090 | 17.9393 | -5.550 | 39.03 | 44.580 | -0.0536 | -1.0144 | 0.0659 | 0 | 1.7029 | 10.86 | 24.87 | 33.5771 |
| T_MIN | 16130 | 8.8073 | 7.2250 | 8.690 | 8.7750 | -14.940 | 26.25 | 41.190 | 0.0233 | -1.0501 | 0.0569 | 0 | -4.4871 | 2.66 | 15.00 | 22.1571 |
| T_AVG | 16130 | 13.3424 | 7.6958 | 13.365 | 13.3438 | -8.875 | 32.05 | 40.925 | -0.0046 | -1.0572 | 0.0606 | 0 | -0.9935 | 6.75 | 19.89 | 27.5435 |
Show code
cleandataset %>%
ggplot(aes(x=index(.)))+
geom_line(aes(y=T_MAX, color = "T_MAX"))+
geom_line(aes(y=T_MIN, color = "T_MIN"))+
geom_line(aes(y=T_AVG, color = "T_AVG"))+
labs(title = "Last 43 years of recorded data", y="Temperature", x=NULL)+
scale_color_manual(name = "Temps",
values = c(T_MAX = "indianred3",T_MIN = "lightblue",T_AVG = "lightgreen"))However this is not that useful so let me zoom in on the last third of the database (approximately 2014 onward)
Show code
NDAYS <- nrow(cleandataset)
lookback <- 365*10
{tail(cleandataset, lookback) %>%
as.data.frame() %>%
mutate(year = index(tail(cleandataset, lookback))) %>%
ggplot(aes(x = year))+
geom_line(aes(y=T_MAX, color = "T_MAX"))+
geom_line(aes(y=T_MIN, color = "T_MIN"))+
geom_line(aes(y=T_AVG, color = "T_AVG"))+
labs(title = "Last 10 years of recorded data", y="Temperature", x=NULL)+
scale_color_manual(name = "Temps",
values = c(T_MAX = "indianred3",T_MIN = "lightblue",T_AVG = "lightgreen"))} %>%
ggplotly()Show code
# animplot1 <- plot1 +
# labs(subtitle = "year: {frame_time}") +
# transition_reveal(year) +
# view_follow(fixed_y = TRUE)
#
# animate(animplot1, nframes = 20, renderer = gifski_renderer(), fps = 30, duration = 10, end_pause = 60, res = 100)2.2 Seasonal analysis
Now i can look at the distribution of my database, first i need to distinguish the data between summer periods and winter periods by taking the day of the year and splitting it so that
Winter ~ October-March
Summer ~ April-September
Show code
DATASET_seas <- DATASET %>%
group_by(Month < 4|Month>9) %>%
rename("Season"="Month < 4 | Month > 9") %>%
mutate(Season = ifelse(Season, "Winter", "Summer"))
DATASET_seas[-1] %>%
xts(order.by = DATASET$DAY)%>%
show_df() %>%
gt() %>%
tab_header(title = "Temperature Overview") %>%
opt_stylize(style = 5, add_row_striping = TRUE) %>%
cols_align(align = "center") %>%
sub_missing(columns = everything(), missing_text = "⋮")| Temperature Overview | |||||||
| DATE | YEAR | Month | DOY | T_MAX | T_MIN | T_AVG | Season |
|---|---|---|---|---|---|---|---|
| 1981-01-01 | 1981 | 1 | 1 | 8.96 | 1.05 | 5.005 | Winter |
| 1981-01-02 | 1981 | 1 | 2 | 8.04 | -0.20 | 3.920 | Winter |
| 1981-01-03 | 1981 | 1 | 3 | 8.21 | 0.49 | 4.350 | Winter |
| 1981-01-04 | 1981 | 1 | 4 | 11.11 | 1.48 | 6.295 | Winter |
| 1981-01-05 | 1981 | 1 | 5 | 6.89 | -2.12 | 2.385 | Winter |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
| 2025-02-24 | 2025 | 2 | 55 | 11.62 | 6.12 | 8.870 | Winter |
| 2025-02-25 | 2025 | 2 | 56 | 9.84 | 7.97 | 8.905 | Winter |
| 2025-02-26 | 2025 | 2 | 57 | 12.52 | 6.05 | 9.285 | Winter |
| 2025-02-27 | 2025 | 2 | 58 | 13.01 | 3.53 | 8.270 | Winter |
| 2025-02-28 | 2025 | 2 | 59 | 11.63 | 3.37 | 7.500 | Winter |
Show code
winter_dataset <- DATASET_seas[DATASET_seas$Season=="Winter",]
summer_dataset <- DATASET_seas[DATASET_seas$Season=="Summer",]
cat(paste0("The difference in number of rows is approximately: ", round(nrow(summer_dataset) / nrow(winter_dataset) - 1, 4)*100,"%", ", or ", round((nrow(summer_dataset) / nrow(winter_dataset) - 1)*NDAYS, 1), " days"))The difference in number of rows is approximately: -0.32%, or -51.9 days
Show code
grid.arrange(ncol=2,
ggplot(winter_dataset)+
geom_histogram(aes(x = T_MAX, fill = "T_MAX"), alpha=0.8, bins = 80)+
geom_histogram(aes(x = T_MIN, fill = "T_MIN"), alpha=0.8, bins = 80)+
geom_histogram(aes(x = T_AVG, fill = "T_AVG"), alpha=0.8, bins = 80)+
labs(title = "Distribution charts of winter",x=NULL)+
scale_fill_manual(name = NULL,
values = c(T_MAX = "indianred3",T_MIN = "lightblue",T_AVG = "lightgreen"))+
theme(legend.position = "bottom")
,
ggplot(summer_dataset)+
geom_histogram(aes(x = T_MAX, fill = "T_MAX"), alpha=0.8, bins = 80)+
geom_histogram(aes(x = T_MIN, fill = "T_MIN"), alpha=0.8, bins = 80)+
geom_histogram(aes(x = T_AVG, fill = "T_AVG"), alpha=0.8, bins = 80)+
labs(title = "Distribution charts of summer",x=NULL)+
scale_fill_manual(name = NULL,
values = c(T_MAX = "indianred3",T_MIN = "lightblue",T_AVG = "lightgreen"))+
theme(legend.position = "bottom")
)Its with this choice of season im still able to keep the data very balanced. clear to see that there
Lastly I’m able to plot just the two average temperatures of winter and of summer for the average temperatures
Show code
ggplot()+
geom_histogram(data = winter_dataset, aes(x = T_AVG, fill = "Winter"), alpha=0.8, bins = 80)+
geom_histogram(data = summer_dataset, aes(x = T_AVG, fill = "Summer"), alpha=0.8, bins = 80)+
labs(title = "Distribution charts of the 2 averages",x=NULL)+
scale_fill_manual(name = NULL,
values = c(Winter="steelblue", Summer="orange"))+
theme(legend.position = "bottom")Just to understand the records of my dataset i wanted to see what months on record had the hottest and coldest days across all dataset
Show code
DATASET_seas %>%
group_by(Month) %>%
summarise(T_min_min = min(T_MIN),
T_min_max = max(T_MIN),
T_max_min = min(T_MAX),
T_max_max = max(T_MAX)) %>%
round(3) %>%
mutate(Month = month.abb) %>%
gt() %>%
opt_stylize(5) %>%
tab_header("Data exploration", "what is the highest and lowest in my database") %>%
cols_label(T_min_min = "Min",
T_min_max = "Max",
T_max_min = "Min",
T_max_max = "Max") %>%
tab_spanner(label = "T_MIN", columns = 2:3) %>%
tab_spanner(label = "T_MAX", columns = 4:5)| Data exploration | ||||
| what is the highest and lowest in my database | ||||
| Month |
T_MIN
|
T_MAX
|
||
|---|---|---|---|---|
| Min | Max | Min | Max | |
| Jan | -14.94 | 7.75 | -5.55 | 20.43 |
| Feb | -12.11 | 9.39 | -3.22 | 21.25 |
| Mar | -6.44 | 11.42 | 0.10 | 24.74 |
| Apr | -2.41 | 14.99 | 6.40 | 29.46 |
| May | 2.82 | 20.05 | 11.28 | 31.17 |
| Jun | 6.98 | 25.93 | 12.24 | 38.17 |
| Jul | 10.45 | 25.60 | 19.44 | 38.76 |
| Aug | 8.67 | 26.25 | 19.18 | 39.03 |
| Sep | 5.14 | 22.04 | 13.48 | 32.65 |
| Oct | -1.39 | 17.52 | 5.43 | 31.23 |
| Nov | -4.44 | 15.37 | 1.36 | 21.93 |
| Dec | -10.77 | 10.09 | -3.70 | 16.59 |
2.3 Long term trends
Show code
SMA(DATASET$T_AVG, lookback) %>% quickplot(show_legend = F, title = "Long term trend")Warning: Removed 3649 rows containing missing values or values outside the scale range
(`geom_line()`).
Show code
runSD(DATASET$T_AVG, lookback) %>% quickplot(show_legend = F, title = "Long term Standard deviation")Warning: Removed 3649 rows containing missing values or values outside the scale range
(`geom_line()`).
3 Seasonal decomposition
The first step in pricing temperature options is to decompose the time series into distinct components that each represent different underlying patterns. This process is essential for understanding the structure of the data and for developing predictive models. In basic temrs it can be expressed as the sum of three main components:\(y_t=T_t+S_t+e_t\), Here, \(T_t\) is the trend component that captures the long-term movement, \(S_t\) represents the seasonal variations (periodic patterns), and \(e_t\) refers to the residuals or noise in the data.
which in our case becomes \[\overline{T_t} = T_{trend} + T_{seasonal}+ Residuals\] The approach was to develop each component indipendently
[T_{trend} = a + bt]
:
The trend component could be as simple as a straight line or a moving average
This form captures the cyclical nature of temperature changes throughout the year, like daily or annual temperature fluctuations.
[ T_{seasonal} = (t + ) ]
\[ T_{seasonal} = \alpha \cdot \sin(\omega \cdot t + \theta) \]
\(\alpha\) represents the amplitude
\(\omega\) is the angular frequency (related to the period of the cycle)
\(\theta\) is the phase shift
And lastly the residuals are just: \[Residuals = T_t - T_{trend} + T_{seasonal}\]:
Recent research has suggested that partial Fourier decomposition can be effective in modeling temperature data, particularly by omitting the cosine component. This approach simplifies the seasonal model while still accurately representing the cyclical nature of temperature changes. The sinusoidal term is often sufficient to describe seasonal temperature variations, making the model both interpretable and computationally efficient.
Before this I used a denoising procedure using Gaussian Convolution Filter as it also enhances the effectiveness of the Fourier transformation by reducing noise, allowing you to identify the dominant frequencies or cycles in your temperature data more accurately.
Gaussian Convolution Filter:
A convolution is a mathematical operation used to blend or smooth data. It involves applying a kernel (a small array of weights) across the data to compute a weighted average. The Gaussian kernel is a specific type of convolution filter that assigns weights based on the Gaussian (normal) distribution, which is a bell-shaped curve.
The Gaussian kernel is widely used for smoothing because it gives more weight to the central points (closer to the middle) while gradually reducing the weights for points further away. This is mathematically expressed as:
\[ G(x)=\frac{1}{\sqrt{2π\sigma^2}}\cdot e^{-\frac{x^2}{2\sigma^2}} \]
Moving Window: The Gaussian kernel slides across each point in the time series. For each position, it computes a weighted sum where the weights are defined by the Gaussian kernel.
Local Averaging: The computed value represents a smoothed average of the points within the window’s range. Points closer to the center of the window contribute more heavily to the average than those farther away, as per the Gaussian weights.
Symmetric Filtering: using 2 sides makes this operation symmetric, applying the kernel both forward and backward across the data, ensuring the filtering effect is centered.
Conclusion and data inspection: in visual terms the denoised data should look smoother with slightly lower peaks
Show code
temps <- DATASET
apply_convolution <- function(x, kernel) {
# Use filter() from stats package to apply convolution
filtered <- stats::filter(x, kernel, sides = 2) # Use sides = 2 for symmetric filter
return(filtered)
}
# DENOISED - convolution with a window of 90 days
kernel <- dnorm(-3:3)
data.frame("Gaussian_Kernel" = round(kernel, 10))| Gaussian_Kernel |
|---|
| 0.0044318 |
| 0.0539910 |
| 0.2419707 |
| 0.3989423 |
| 0.2419707 |
| 0.0539910 |
| 0.0044318 |
Show code
temps$Denoised <- apply_convolution(temps$T_AVG, kernel)
temps$Denoised <- na.fill(temps$Denoised, mean(temps$Denoised, na.rm = TRUE))
temps$Trend <- SMA(temps$Denoised, n = lookback)
library(nlme, quietly = T, warn.conflicts = F)
# Define the model
sin_component <- function(t, a, b, alpha, theta) {
omega <- 2 * pi / 365.25
a + b * t + alpha * sin(omega * t + theta)
}
omega <- 2 * pi / 365.25
# Fit model using non linear squares
temps$NUM_DAY <- 1:nrow(temps)
fit <- nls(Denoised ~ sin_component(NUM_DAY, a, b, alpha, theta),
data = temps,
start = list(a = 1, b = 0, alpha = 1, theta = 0))
# Get coefficients and confidence intervals for the model
params <- coef(fit)
confint_fit <- confint(fit)Waiting for profiling to be done...
Show code
temps$SEAS <- params["alpha"] * sin(omega * temps$NUM_DAY + params["theta"])
temps$TREND <- params["a"] + params["b"] * temps$NUM_DAY
temps$BAR <- temps$TREND + temps$SEAS
temps$RESID <- temps$T_AVG - temps$TREND - temps$SEASjust to be certain i looked at if the residuals and the fitted values matched the \(\overline T\) and \(Residuals\) by checking percentage of same values as the rounding decimals increase The logic is that if the data sets are anything alike (like 2 different models results for the same dataset) you should see how far in the rounding are the data sets similar
Show code
cat("T_BAR VS fitted from the non linear squares \n")T_BAR VS fitted from the non linear squares
Show code
check_acc(temps$BAR, fitted(fit),15)Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Show code
cat("T_BAR VS fitted from the non linear squares \n")T_BAR VS fitted from the non linear squares
Show code
check_acc(temps$RESID, residuals(fit),15)Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
3.1 Model performance
for assessing my model performance I’m looking at the values and confidence intervals and looking at the Residual sum of squares and the mean absolute error
The RSS function calculates the Residual Sum of Squares, which is a measure of the discrepancy between the observed data and the model’s predictions. A lower RSS indicates a better fit of the model to the data, meaning that the model’s predictions are closer to the actual observed values.
The MAE function is the average of the absolute differences between the observed and predicted values. It gives an idea of how far, on average, the predictions are from the actual values, without considering the direction of the error. A lower MAE indicates that the model’s predictions are generally close to the observed data.
a : 12.32 CI ~normally [ 12.241 , 12.399 ]
b : 0 CI ~normally [ 0 , 0 ]
alpha : -10.082 CI ~normally [ -10.138 , -10.026 ]
theta : -17.61 CI ~normally [ -17.615 , -17.604 ]
RSS model sine curve: 126124.7
MAE model fit: 2.25
3.2 Visualization of results
Show code
# plot denoised
ggplot(tail(temps, lookback), aes(x=DAY))+
geom_point(aes(y = T_AVG, color = "Average"), size = 1)+
geom_point(aes(y = Denoised, color = "Denoised"), size = 1)+
scale_color_manual(name=NULL, values = c(Average = "royalblue", Denoised = "#ffcc00"))+
labs(title = "Average temperature", x = "Date", y = "Temperature",
subtitle = "Before and after the gaussian convolution filter")Show code
temps_xts <- temps %>%
select(T_AVG, Denoised, TREND, SEAS, RESID) %>%
xts(order.by = temps$DAY) %>%
tail(lookback)
# Plot seasonal decomposition from Avg to residuals
grid.arrange(nrow=5, top = paste0("Classical decomposition - last ", lookback/365, " years"),
temps_xts$T_AVG %>%
quickplot(subtitle = "Average Temperature", show_legend = F, xlab = NULL, ylab = "Temps"),
temps_xts$Denoised %>%
quickplot(subtitle = "Denoised", show_legend = F, xlab = NULL, ylab = "Temps"),
temps_xts$TREND %>%
quickplot(subtitle = "Trend", show_legend = F, xlab = NULL, ylab = "Temps"),
temps_xts$SEAS %>%
quickplot(subtitle = "Seasonal", show_legend = F, xlab = NULL, ylab = "Temps"),
temps_xts$RESID %>%
quickplot(subtitle = "Residuals", show_legend = F, xlab = NULL, ylab = "Temps"))Show code
# Plot original vs. fitted data
ggplot(temps, aes(x = DAY)) +
geom_point(aes(y = T_AVG), color = 'royalblue', size = 0.5) +
geom_line(aes(y = BAR), color = 'orange', linewidth=2) +
labs(title = "Temperature Model Fit (all Observations)", y = "Temperature (deg C)")3.2.1 Check for possible model degradation
Checking for model degradation across time by looking at the first and last 10 years to see if there is a meaningful shift in where the sine curve and the average temperature
Show code
grid.arrange(nrow = 2, ncol = 2,
ggplot(temps %>% head(lookback), aes(x = DAY)) +
geom_point(aes(y = T_AVG), color = 'royalblue', size = 0.5) +
geom_line(aes(y = BAR), color = 'orange', linewidth=2) +
labs(title = paste0("Temperature Model Fit (First ", lookback/365, " years)"),x=NULL, y = NULL),
ggplot(temps %>% head(lookback), aes(x = DAY)) +
geom_line(aes(y = RESID), color = 'black', linewidth=0.5) +
labs(title = paste0("Residuals (First ", lookback/365, " years)"),x=NULL, y = NULL),
ggplot(temps %>% tail(lookback), aes(x = DAY)) +
geom_point(aes(y = T_AVG), color = 'royalblue', size = 0.5) +
geom_line(aes(y = BAR), color = 'orange', linewidth=2) +
labs(title = paste0("Temperature Model Fit (Last ", lookback/365, " years)"), x=NULL, y = NULL),
ggplot(temps %>% tail(lookback), aes(x = DAY)) +
geom_line(aes(y = RESID), color = 'black', linewidth=0.5) +
labs(title = paste0("Residuals (Last ", lookback/365, " years)"),x=NULL, y = NULL)
)3.3 Residuals analysis and diagnostics
Autocorrelation Function (ACF) Plot of Residuals:
The ACF plot checks for any correlation between residuals at different lags. If residuals are uncorrelated (i.e., resemble white noise), all values should fall within the confidence bands, indicating no significant autocorrelation.
Partial Autocorrelation Function (PACF) Plot of Residuals:
The PACF plot displays the partial correlation of residuals with their own lagged values, considering the effect of any intermediate lags. Like the ACF plot, if the model fits well, the PACF values should also fall within the confidence bands, showing no significant partial autocorrelations.
Normality Check using QQ Plot:
The QQ plot assesses whether the residuals are normally distributed. Points should lie along the reference line; significant deviations from this line suggest that the residuals do not follow a normal distribution, which could imply model misspecification or the presence of outliers.
Histogram of Residuals with Normal Curve:
The histogram visualizes the distribution of residuals, and the overlaid normal curve helps assess normality. If the histogram closely follows the bell-shaped curve, it suggests normal residuals. The vertical line representing kurtosis indicates whether the residuals are more or less peaked than a normal distribution.
Residuals vs. Fitted Values Plot:
The plot checks for patterns in the residuals against fitted values. Ideally, the residuals should be randomly scattered around zero, indicating homoscedasticity (constant variance). Any visible pattern or trend suggests issues like non-linearity, heteroscedasticity, or omitted variables.
Show code
grid.arrange(nrow = 2,
# ACF
ggAcf(temps$RESID, lag.max = 100)+
labs(title = "ACF of Residuals", x = NULL, y = NULL),
# PACF
ggPacf(temps$RESID, lag.max = 100)+
labs(title = "PACF of Residuals", x = NULL, y = NULL),
## Check normality of residuals using QQ plot
ggplot(temps, aes(sample = RESID)) +
stat_qq(color="royalblue")+
stat_qq_line(color = "black", linewidth = 0.4)+
labs(title = "QQ plot", x="Theoretical Quantiles", y= "Observed Quantiles"),
## Check for heteroskedasticity or any pattern in residuals
ggplot(temps %>% tail(lookback), aes(x=BAR, y = RESID))+
geom_point(size = 0.4)+
geom_smooth(method = "lm")+
labs(title = paste0("Residuals vs Fitted - last ", lookback/365, " years"), x= "Fitted Values", y = "Residuals"))`geom_smooth()` using formula = 'y ~ x'
Show code
# Histogram with bell curve and kurtosis
ggplot(temps, aes(x = RESID)) +
geom_histogram(aes(y = after_stat(density)), fill = "lightblue", bins = 30) +
stat_function(fun = dnorm, args = list(mean = mean(temps$RESID), sd = sd(temps$RESID)),
color = "red", linewidth = 1.2) +
# geom_vline(xintercept = skewness(temps$RESID)[1], linetype = "dashed", linewidth=1, color = "darkred")+
labs(title = "Histogram of Residuals with Normal Curve", x = "Residuals", y = "Density",
# subtitle = "Vertical line is kurtosis"
)4 Ornstein-Uhlenbeck (OU) process
Temperature dynamics using a mean-reverting stochastic process like the Ornstein-Uhlenbeck (OU) process, incorporate the seasonal adjustment into the drift rate to ensure that the expected value of the temperature follows the seasonal mean temperature.
The Ornstein-Uhlenbeck (OU) process is a type of stochastic (random) process that is often used in physics, finance, and other fields to model systems that exhibit some form of “mean-reverting” behavior. Let’s start from the basics to understand what this means.
A stochastic process is a collection of random variables representing the evolution of a system over time. In simpler terms, it describes how something changes when randomness is involved. For example, the price of a stock or the position of a particle under the influence of random forces can be modeled as a stochastic process.
\[ dX_t = \kappa (\mu−X_t) \ d_t + \sigma \cdot dW_t \]
- \(X_t\) is the value of the process at time t.
- \(\mu\) is the long-term mean or equilibrium level toward which the process reverts.
- \(\theta>0\) is the rate of mean reversion, determining how fast the process reverts to the mean P.
- \(\sigma >0\) is the volatility or the intensity of the random fluctuations.
- \(dW_t\) is an infinitesimal increment of a Wiener process (Brownian motion).
Mean Reversion Term: \(\kappa (\mu−X_t) \ d_t\) : Representing the tendency of the process to revert to the mean \(\mu\). If \(X_t\)is above the mean \(\mu\), this term will be negative (pulling it down toward the mean). If \(X_t\) is below \(\mu\), the term will be positive (pushing it up toward the mean).
The speed of this pull or push is determined by \(\kappa\). A higher \(\kappa\) means the process will revert to the mean faster.
Random Fluctuation Term \(\sigma \cdot dW_t\) : represents the random noise or shock. This is a source of randomness, and it causes the process to deviate randomly over time.
The size of the noise is controlled by \(\sigma\), which is the volatility parameter.
Show code
temps_OU <- temps
# Define parameters for the OU process
kappa <- 1-arima(temps_OU$RESID, order = c(1,0,0))$coef[1] # Mean-reversion rate
sigma <- 0.1 # Volatility of the process
dt <- 1 # Time step (daily data)
cat("Kappa is estimated as:", round(kappa,4))Kappa is estimated as: 0.1515
Show code
# Initialize variables for simulation
n <- nrow(temps_OU) # Number of time points
T_simulated <- numeric(n) # Simulated temperature vector
T_simulated[1] <- temps_OU$Denoised[1] # Set initial temperature to the first observed value
# Simulate the seasonal mean as a time-varying mean (trend + seasonal component)
T_bar <- temps_OU$BAR
# Simulate the modified OU process
for (i in 2:n) {
# Rate of change of the seasonal mean
dT_bar_dt <- (T_bar[i] - T_bar[i - 1]) / dt
# Brownian motion increment
dWt <- rnorm(1, mean = 0, sd = sqrt(dt))
T_simulated[i] <- T_simulated[i - 1] +
(dT_bar_dt + kappa * (T_bar[i] - T_simulated[i - 1])) * dt +
sigma * dWt
}
temps_OU$OU <- T_simulated
DATE <- tail(temps$DAY, lookback)
ggplotly(
temps_OU %>%
select(Denoised, OU, BAR) %>%
tail(lookback) %>%
round(4) %>%
ggplot(aes(x = DATE)) +
geom_point(aes(y = Denoised, color = "Observed"), size = 0.5) +
geom_line(aes(y = OU, color = "Simulated"), linewidth = 0.8) +
geom_line(aes(y = BAR, color = "Model fit"), linewidth = 0.5) +
scale_color_manual(name = NULL, values = c(Observed = "blue", Simulated = "darkgreen", "Model fit" = "red")) +
labs(title = "Simulated Ornstein-Uhlenbeck Process for Temperature - Last 10 years", x = "Day", y = "Temperature")
)5 Sidequests
5.1 Sidequest: AR modeling
Fit an AR(1) model if there is significant autocorrelation
Show code
ar1_model <- arima(temps$RESID, order = c(1,0,0))
summary(ar1_model)
Call:
arima(x = temps$RESID, order = c(1, 0, 0))
Coefficients:
ar1 intercept
0.8485 0.0010
s.e. 0.0042 0.0769
sigma^2 estimated as 2.189: log likelihood = -29206.38, aic = 58418.76
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -2.00335e-05 1.479507 1.149487 176.4712 368.9013 0.9715242
ACF1
Training set 0.1019729
Show code
ar1_fitted <- fitted(ar1_model)
quickplot(ar1_fitted, title = paste0("AR(", length(ar1_model$coef), ") Model fitted"))5.2 Sidequest: overlap of the dataset
One of the points that i would need for later is the standard deviation across the days of year, a pivot table is used for this step as with this I’m able to line up all the dataset month to month, by doing this im also able to analize seasonal patterns and look at the seasonality at a glance
Show code
pivot_df <- DATASET %>%
select(DOY, YEAR, T_AVG) %>%
pivot_wider(names_from = YEAR, values_from = T_AVG)
MAX_pivot_df <- DATASET %>%
select(DOY, YEAR, T_MAX) %>%
pivot_wider(names_from = YEAR, values_from = T_MAX)
MIN_pivot_df <- DATASET %>%
select(DOY, YEAR, T_MIN) %>%
pivot_wider(names_from = YEAR, values_from = T_MIN)
MEAN <- apply(pivot_df[-1], 1, mean, na.rm=TRUE)
MEDIAN <- apply(pivot_df[-1], 1, median, na.rm=TRUE)
IQR <- apply(pivot_df[-1], 1, IQR, na.rm=TRUE)
SD <- apply(pivot_df[-1], 1, sd, na.rm=TRUE)
data.frame(MEAN = MEAN,
MEDIAN = MEDIAN) %>%
quickplot(title = "Mean and median across the year", xlab = "Day of the year", ylab = "Temperature")Show code
data.frame(IQR = SMA(IQR, n = 7),
SD = SD) %>%
quickplot(title = "Standard deviation and Inter-quartile-range across the year",
subtitle = "IQR MA(7)", xlab = "Day of the year", ylab = "Temperature")Warning: Removed 6 rows containing missing values or values outside the scale range
(`geom_line()`).
Show code
data.frame(MAX = apply(pivot_df[-1], 1, max, na.rm=TRUE),
MIN = apply(pivot_df[-1], 1, min, na.rm=TRUE)) %>%
mutate(AVG = (MAX + MIN)/2) %>%
mutate(RANGE = MAX - MIN) %>%
quickplot(title = "Range across the year", show_legend = T, xlab = "Day of the year", ylab = "Temperature")Show code
melt(pivot_df[-1], id.vars = NULL) %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot(fill = "gray") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Boxplot for all years", x = NULL, y = NULL)Warning: Removed 340 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Show code
DATASET %>%
select(Month, T_AVG) %>%
group_by(Month) %>%
ggplot(aes(x = Month, y = T_AVG, group = Month)) +
geom_boxplot(fill = "gray") +
scale_x_continuous(breaks = seq(1, 12, by = 1), labels = month.abb)+
labs(title = "Boxplot for all months across the years", x = NULL, y = NULL)5.2.1 How is this year compared to the rest
this dataset is 43 years long and there have been some really hot points and some really cold points over the years, so i wanted to look at my dataset and create a ribbon of all the max and mins, to see if this year has been so hot when looking at the averages, while this is in no way conclusive evidence it at least gives you a glimpse at how the climate is changing especially when looking at the middle of the year.
Show code
ggplotly(
ggplot(DATASET, aes(x = DOY, y = T_AVG, group = YEAR, color = YEAR)) +
geom_line() +
labs(title = "All years overlapped", x = "Day of the year",
y = "Index level of returns"))Show code
THISYEAR <- data.frame(c(DATASET[DATASET$YEAR == 2024,]["T_MAX"]),
c(DATASET[DATASET$YEAR == 2024,]["T_AVG"]),
c(DATASET[DATASET$YEAR == 2024,]["T_MIN"]))
ggplotly(cbind(pivot_df,
"MIN" = apply(pivot_df[-1], 1, min, na.rm = TRUE),
"MAX" = apply(pivot_df[-1], 1, max, na.rm = TRUE),
"MEAN" = apply(pivot_df[-1], 1, mean, na.rm = TRUE)) %>%
round(2) %>%
ggplot(aes(x = DOY)) +
geom_ribbon(aes(ymin = MIN, ymax = MAX), alpha = 0.2) +
geom_line(aes(y = MEAN, color = "MEAN"), linewidth = 1) +
geom_line(aes(y = pivot_df$'2024', color = "AVG_Current"), linewidth = 1.3) +
geom_line(aes(y = MAX_pivot_df$'2024', color = "MAX_Current"), linewidth = 0.6, alpha = 1) +
geom_line(aes(y = MIN_pivot_df$'2024', color = "MIN_Current"), linewidth = 0.6, alpha = 1) +
scale_color_manual(name = "Years", values = c(MEAN = "orange2", AVG_Current = "olivedrab",
MIN_Current = "lightblue", MAX_Current = "indianred")) +
labs(title = "Is this year hotter on average?",
y = NULL,
x = NULL)
)5.3 Sidequest: check accuracy
Show code
T_SEAS <- params["alpha"] * sin(omega * temps$NUM_DAY + params["theta"])
T_TREND <- params["a"] + params["b"] * temps$NUM_DAY
T_BAR <- T_TREND + T_SEAS
data.frame(
Trend = T_TREND,
Seasonal = T_SEAS,
myfit = T_BAR,
oldfit = fitted(fit)) %>%
quickplot(plot_engine = "plotly")Show code
cat("MODEL VS OU\n")MODEL VS OU
Show code
check_acc(fitted(fit), temps_OU$OU, 10)Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Show code
cat("BAR VS OU\n")BAR VS OU
Show code
check_acc(temps$BAR, temps_OU$OU, 10)Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
5.4 B-splines
Show code
library(splines)
# Define the number of knots
knots <- c(1, 3, 5, 10, 15, 20, 30, 50, 80)
# Function to create a spline model and plot
create_spline_plot <- function(knots, x, y) {
# Fit the spline model
spline_model <- lm(y ~ bs(x, knots = knots))
yfit <- predict(spline_model, data.frame(x = x))
# Calculate Residual Sum of Squares (RSS)
rss <- sum((y - yfit)^2)
# Create the plot
plots <- ggplot() +
geom_point(aes(x, y), color = 'cornflowerblue', size = 1.5) +
geom_line(aes(x, yfit), color = 'black', linewidth = 1) +
ggtitle(paste("Knots #", knots, "\nRSS:", round(rss, 2))) +
labs(y = "Temps", x = NULL) +
theme_classic()+
theme(plot.title = element_text(size = 10, face = "bold"))
return(plots)
}
# Generate all the plots
plots <- lapply(knots, create_spline_plot, x = 1:366, y = SD)
# Arrange and display the plots in a 2x3 grid
grid.arrange(grobs = plots, nrow = 3, ncol = 3)